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Abstract 

Background: In this study we consider DNA sequences as mathematical strings. Total and reduced alignments 
between two DNA sequences have been considered in the literature to measure their similarity. Results for explicit 
representations of some alignments have been already obtained. 

Results: We present exact, explicit and computable formulas for the number of different possible alignments 
between two DNA sequences and a new formula for a class of reduced alignments. 

Conclusions: A unified approach for a wide class of alignments between two DNA sequences has been provided. 
The formula is computable and, if complemented by software development, will provide a deeper insight into the 
theory of sequence alignment and give rise to new comparison methods. 
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Background 

Let us consider a DNA sequence as a mathematical string 

X — (X\ , X2) • • • ) Xyi) , 

where %i e {A, G, C, T] is one of the four nucleotides, i = 
1, 2, . . . , n, i.e. A denotes adenine, C cytosine, G guanine 
and T thymine. In these conditions, the sequence x is of 
length n. 

Our main goal is to compare the sequence x with 
another DNA sequence 

y = (yi,y2>-.->ym)> 

to measure the similarity between both strings and also to 
determine their residue-residue correspondences. 

Sequence comparison and alignment is a central and 
crucial tool in molecular biology. For example, Pairwise 
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Sequence Alignment is used to identify regions of simi- 
larity that may indicate functional, structural and/or evo- 
lutionary relationships between two biological sequences 
(protein or nucleic acid) [1]. 

For some recent developments and directions we refer 
the reader to [2-7] and [8] for a general review of different 
alignments methods. 

To align the sequences CGT and ACTT, one can use 
EMBOSS Needle for nucleotide sequence [9] that creates 
an optimal global alignment of the two sequences using 
the Needleman-Wunsch algorithm to get 

EMBOSS-001 1 - C G T 3 
I • I 

EMBOSS-001 1 A C T T 4 

Following Lesk [10], in order to compare the amino 
acids appearing at their corresponding positions in two 
sequences, theirs correspondences must be assigned and 
a sequence alignment is the identification of residue- 
residue correspondence. For some references on sequence 
alignment we refer the reader to [10-16]. 

To compare two sequences, there exist mainly three dif- 
ferent possibilities leading to three different numbers of 
total alignments [10,11,13]: 
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1. The total number of alignments denoted by f(n, m) 
that was solved in [13]. 

2. A gap in a sequence is followed by another gap in the 
other sequence as in Alignments 1 and 2 for the 
sequences x = CGT and y — ACTT (see Tables 1 
and 2 below) 

Considering the two alignments as equivalents to the 
Alignment 3 (see Table 3) without gap in those 
positions, we have the number of reduced alignments 
denoted by h(n, m), and obviously h(n, m) < f(n, m). 
This case has been solved in [11], and we give here 
another representation in terms of hypergeometric 
series. 

3. In the interesting case that the alignments 1 and 2 are 
equivalent, but different from alignment 3 we have a 
number or reduced alignments g(n, m) where 

h(n, m) < g(n, m) <f(n, m). This last case is new 
and we present an explicit formula for g. 

Results and discussion 
Number off(x,y) alignments 

The total number of alignments/^, y) satisfies the follow- 
ing recurrence relation [13] 

f(n, m) — f(n — 1, m) + fin, m — 1) + f(n — 1, m — 1), 

with initial conditions fin, 0) = /(0, m) = 1 for n, m = 
1, 2, 3, . . .. The solution of the above partial difference 
equation is given by 

min{n,m} 



Table 2 Alignment 2 



LiiLiL\ri,rrij / \ / \ 



(see formula (10) in [13]) and the generating function 
[17,18] is 

1 

F(x,y) = ■ ■ -. 

xy + x + y — 1 

Therefore the coefficients/^, m) in the expansion 

oo oo 

F{ X ,y) = w yy" 

n—0 m=0 

are given in terms of a hypergeometric series by 
fin,m) = 2^1 (— — n; 1; 2). 

This relation seems to be new in this form. Here, the 
generalized hypergeometric series is defined as (see e.g. 
[19, Chapter 16]) 

i a l)k ( a 2)k ' ' ' { a p)k k 

^(-i--^--*^ = E iH(Wt(Wjt ... (A4) /. 



c 




G 


T 




A 


C 




T 


T 



and iA) k = A(A+l) • • • with (A) 0 = 1, denotes 

the Pochhammers symbol. It is assumed that bj ^ —k 
in order to avoid singularities in the denominators. If one 
of the parameters aj equals to a negative integer, then the 
sum becomes a terminating series. 

Number of hix, y) alignments 

In this case, the recurrence relation for the hin, m) coeffi- 
cients is [11] 

hin, m) = hin — 1, m) + hin, m — 1) — — 2,m — 2), 
m > 2, 



with initial conditions hin, 0) = /z(0, m) 
the generating function [17,18] is 

1 — xy 



1. Therefore, 



Hix,y) = , 
— x — y + 1 

and the coefficients in the expansion 

oo oo 

H(x,y) = J2J2 h ( n > m ) xn y m 



n—0 m—0 



are given by 



it x v-(-iy(-3/ + m + «)! 
m) = ) — — 

i=0 

B 



im - 2i)l in - 2i)\ 



i-l) l i-3i + m + n-2)l 
ili-2i + rn- 1)1 (-2/ + « - 1)!' 



where 
A = 



-![?]•[?])• 



rr«-ii - ill 



The above coefficients can be written in terms of (ter- 
minating) hypergeometric series as 



(m -\- nV / l ~ m —VI l ~ n —H 
.... , 4^3 



4ft 



1— m _ m 1— n 
2 ' 2' 2 ' 2 o 16 
-m—n —m—n+1 —m—n+2 27 
3 ' 3 ' 3 

im + n- 2)1 
im - 1)1 in - 1)1 

l—m i m 1— n i « 

2 ' 2 ' 2 ' 2 

-m— H+2 — m— H+3 — m— h+4 



Table 1 Alignment 1 Table 3 Alignment 3 
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Number of g(x, y) alignments 

As indicated before, the main aim of this paper is to give 
an explicit representation in this case. The recurrence 
relation for theg(H, m) coefficients is [11] 

g(n, m) = g(n — 1, m — 1) + g(n — l f m) + g(n, m — 1) 
- 2g(n - 2, m - 2), n,m > 2, 

with initial conditions 0) = 0) = 1. Thus, the 
generating function [17,18] is 

1 — xy 



G(x,y) 



2x 2 y 2 — xy — x — y +1 



(i) 



Theorem 1. The coefficients a n>m in the expansion 
1 — xy 



G(x,y) = 



2x 2 y 2 — xy — x — y + 



1 = J2J2 a n,mX n y n 



n=0 m=0 



(2) 



are explicitly given by 

(n+m B{i,n,m) \ 
^ * ^ ] Pi,j,n,m I 

i=U(n,m) j=A{i,n,m) ) 

(n+m-2 D(i,n,m) 
i—U(n,m) — l j=C(i,n,m) 

where 

{-ly-jy-ni 



(3) 



(/-/)! (2i-j-m)\ (2i-j - n)\ (3/ - 4i + m + «)! 



(4) 



Yi,j,n,m — 



(-1)^2^/! 



(/-;)! (2i-j-m+l)\ (2i-j-n+l)\ (3j-4i+rn+n-2)\ 

(5) 



T4r 

* - 



m — n 



A(i, n, m) = max 
£(7, m) = min 
C(7, m) — max 



£)(/, m) = min j/, 2/ — m + 1, 2/ — n + 1, 
^4/ — n — m + 2" 



{[4/ — n — ml ] 
/, 2/ — m, 2/ — | , 

f [4i-m-n-2~\] 

H — 5 — • 



]) 



LI (n, m) = 



m - [n/2] , 



n < m, 



[(m + l)/2] + n — m, n > m, 
and [#] denotes the integer part of x. 



(6) 
(7) 
(8) 

(9) 

(10) 



Proof. If we expand, 



G(x,y) = (1 — xy) ^ (x + y + xy — 2x 2 y 2 ) 1 = (1 — xy) 

i=0 

oo / i / ; / k /'\ / 

*E E E £ ( - 1) '~' 2 '~^ 

i=0 \;=0 \/:=0 \s=0 
k \ y 2i-j-s x 2i-j-k+s 



(ID 



we have two summands to be computed, namely 



E E E E(-d h '2'- 

*=o \/=o \k=0 \s=0 



M/7 



(12) 



y 2i ~i~ s x 2i ~i~ k+s 



oo / i I j Ik 



§ £ § ( -" W2H C)C) 



/=0 \/=0 \/c=0 \s=0 



yli-j-s x 2i-j-k+s 



(13) 



In order to compute the first sum (12) let us introduce 
m = 2i-j- s, n = 2i-j-k + s. (14) 
Therefore, the summation to be done reads as 

oo oo I V B /'\ / • \ 



#=0 ra=0 \i=U j=A 



where U, V, A and £ must be computed in terms of the 
initial indices. 
The product of binomials can be simplified to 



(i -;)! (2i -j - m)\ (2i -j - n)\ (3; - 4/ + m + n)\ 
Thus, 

i > 0, j > 0, 4/ - 2/ -m-n> 0, 4/ -2j - m 
-n>0, 2i-j-m> 0, / - ; > 0, 2/-; 
- m > 0, 2i-j-n>0, 3; - 4/ + m + n > 0, 
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and then 

A(i, n,m) = A = max 



* - 



m — n 



14/ — n — m 
U 2i — rn, 2i — n, 

= B(i, n, m) = B. 
Finally, the summation reads as 

oo oo / «+w B 
«=0m=0 \i=U(n,m) j=A 

(-l)'-'2Hfl . 

(/-/)! (2i-j-m)\ (2i-j - ri)\ (3j-4i+m+n)\ ) X y ' 

where 



m — [n/2] , n < m, 

[{m + l)/2]+n — m, n > m. 



A similar work with the second summand (13) leads to 
the final result. □ 

Some numerical values are ^(10, 10) = 2003204, 
#(50, 50) = 2.71972 x 10 34 ,^(100, 100) = 7.55997 x 10 69 , 
and we note that ^(n, n) > 10 80 for n > 115. This last 
inequality is relevant since 10 80 is an estimation of the 
number of protons of our universe [13]. 

Conclusions 

A unified approach for a wide class of alignments between 
two DNA sequences has been provided. We conclude also 
that our approach gives an explicit formula filling a gap 
in the theory of sequence alignment. The formula is com- 
putable and, if complemented by software development, 
will provide a deeper insight into the theory of sequence 
alignment and give rise to new comparison methods. It 
may be used also, in the future, to get explicit formulas 
and compute the number of total, reduced, and effective 
alignments for multiple sequences. 

Methods 

We have performed a number of numerical computa- 
tions to compare our formulae and Mathematica® [20] 
command Coefficient for the series expansion of (1), on 
a MacBook Pro featuring a 45 nm "Penryn" 2.66 GHz 
Intel "Core 2 Duo" processor (P8800), with two indepen- 
dent processor "cores" on a single silicon chip, 8 GB of 
1066 MHz DDR3 SDRAM (PC3-8500). We would like 
to mention that our approach is amazingly fast, since 
e.g. g(100, 100) is computed by using Mathematica® in 
0.125165 seconds by using the new formulas presented 
in this paper, while the use of Mathematica® command 
Coefficient needs 99.167659 seconds. 
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